1 

Directed Information, Causal Estimation, and 
Communication in Continuous Time 

Tsachy Weissman, Young-Han Kim and Haim H. Permuter 

Abstract 

A notion of directed information between two continuous-time processes is proposed. A key component in the 
definition is taking an infimum over all possible partitions of the time interval, which plays a role no less significant 
than the supremum over "space" partitions inherent in the definition of mutual information. Properties and operational 
interpretations in estimation and communication are then established for the proposed notion of directed information. 
For the continuous-time additive white Gaussian noise channel, it is shown that Duncan's classical relationship between 
causal estimation and information continues to hold in the presence of feedback upon replacing mutual information 
by directed information. A parallel result is established for the Poisson channel. The utility of this relationship is then 
demonstrated in computing the directed information rate between the input and output processes of a continuous-time 
Poisson channel with feedback, where the channel input process is constrained to be constant between events at the 
channel output. Finally, the capacity of a wide class of continuous-time channels with feedback is established via 
directed information, characterizing the fundamental limit on reliable communication. 

Index Terms 

Causal estimation, continuous time, directed information, Duncan's theorem, feedback capacity, Gaussian channel, 
Poisson channel, time partition. 

I. Introduction 

The directed information I(X n — > Y n ) between two random n-sequences X n = [X\, . . , ,X n ) and Y n = 
(Yi, . . . , Y n ) is a natural generalization of Shannon's mutual information to random objects obeying causal relations. 
Introduced by Massey (TJ, this notion has been shown to arise as the canonical answer to a variety of problems 
with causally dependent components. For example, it plays a pivotal role in characterizing the capacity Cfb of a 
communication channel with feedback. Massey [ 1 ] showed that the feedback capacity is upper bounded as 

C FB < Um max -I(X n -»• Y n ), 

n->oo p(x n \\y n - 1 ) n 
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where I(X n -> Y n ) = YJ- =1 Y^Y 1 ' 1 ) and p(x n \\y n - 1 ) = W^pix^x^ ,y i ~ 1 )\ see also G). This upper 
bound is tight for certain classes of ergodic channels ||3|-||5), paving the road to a computable characterization of 
feedback capacity; see [6|-[8| for examples. 

Directed information and its variants also characterize (via multiletter expressions) the capacity for two-way 
channels, multiple access channels with feedback |2| |9|, broadcast channels with feedback [10|, and compound 
channels with feedback fTTl . as well as the rate-distortion function with feedforward lTT2l . |[T3l . In another context, 
directed information captures the difference in growth rates of wealth in horse race gambling due to causal side 
information fl4l . This provides a natural interpretation of I(X n — > Y n ) as the amount of information about Y n 
causally provided by X n on the fly. Similar interpretations for directed information can be drawn for other problems 
in science and engineering fl5l . 

This paper is dedicated to extending the mathematical notion of directed information to continuous-time random 
processes, and to establishing results that demonstrate the operational significance of this notion in estimation and 
communication. Our contributions include the following: 

• We introduce the notion of directed information in continuous time. Given a pair of continuous-time processes 
in a time interval and its partition consisting of n subintervals, we first consider the (discrete-time) directed 
information for the two sequences of length n whose components are the sample paths on the respective 
subintervals. The resulting quantity depends on the specific partition of the time interval, and we define directed 
information in continuous time by taking the infimum over all finite time partitions. Thus, in contrast to mutual 
information in continuous time which can be defined as a supremum of mutual information over finite "space" 
partitions [16, Ch. 2.5], fTTl , inherent to our notion of directed information is a similar supremum followed by 
an infimum over time partitions. We explain why this definition is natural by showing that the continuous-time 
directed information inherits key properties of its discrete-time origin and establishing new properties that are 
meaningful in continuous time. 

« We show that this notion of directed information arises in extending classical relationships between information 
and estimation in continuous time — Duncan's theorem [18| that relates the minimum mean squared error 
(MMSE) in causal estimation of a target signal based on an observation through an additive white Gaussian 
noise channel to the information between the target signal and the observation, and its counterpart for the 
Poisson channel — to the scenarios in which the channel input process can causally depend on the channel 
output process. 

> We illustrate these relationships between directed information and estimation by characterizing the directed 
information rate and the feedback capacity of a continuous-time Poisson channel with inputs constrained to 
constancy between events at the channel output. 

• We establish the fundamental role of continuous-time directed information in characterizing the feedback 
capacity of a large class of continuous-time channels. In particular, we show that for channels where the 
output is a function of the input and some stationary ergodic "noise" process, the continuous-time directed 
information characterizes the feedback capacity of the channel. 
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The remainder of the paper is organized as follows. Section [TT] is devoted to the definition of directed information 
and related quantities in continuous time, which is followed by a presentation of key properties of continuous-time 
directed information in Section [Till in Section IIV1 we establish the generalizations of Duncan's theorem and its 
Poisson counterpart that accommodate the presence of feedback. In Section [V] we apply the relationship between 
the causal estimation error and directed information for the Poisson channel to compute the directed information 
rate between the input and the output of this channel in a scenario that involves feedback. In Section [Vl] we study a 
general feedback communication problem in which our notion of directed information in continuous time emerges 
naturally in the characterization of the feedback capacity. Section [VHl concludes the paper with a few remarks. 



II. Definition and Representation of Directed Information in Continuous Time 

Let P and Q be two probability measures on the same space and ^ be the Radon-Nikodym derivative of P 
with respect to Q. The relative entropy between P and Q is defined as 

\f(\°g%)dP if 45 exists, 
D{P\\Q):= } Jy dQ > dQ (1) 

I oo otherwise. 
For jointly distributed random objects U and V, the mutual information between them is defined as 

I(U;V) := D(Pu,v\\Pu * Pv), (2) 

where Pjj x Py denotes the product distribution under which U and V are independent, but maintain their respective 
marginal distributions. We write I(Puy) instead of I(U;V) when we wish to emphasize the dependence on the 
joint distribution P\jy. For a jointly distributed triple (Z7, V, W), the conditional mutual information between U 
and V given W is defined as 

I{U-V\W) := J I{P uy \ w = w )dP w {w), (3) 

where Pjj,v\w=w is a regular version of the conditional probability law of (U, V) given {W = w}. We note that 
U, V, W in © and OJ are random objects that can take values in an arbitrary measurable space. In this paper, these 
objects will most commonly be either random variables or continuous-time stochastic processes. 

An alternative approach lfl61 Ch. 2.5] to defining the mutual information (and, subsequently, the conditional 
mutual information) when U and V take values in general abstract alphabets U and V, respectively, is to define 

I(U;V) := sup I([U] ;[V]), (4) 

where the supremum is over all finite partitions (quantizations) of U and V. That the two notions coincide has been 
established in, e.g., (T7), ED- 

Let {X n , Y n ) be a pair of random ?i-sequences. The directed information from X n to Y n is defined as 

n 

i(x n -> Y n ) ■= ^ /(A^y^y*- 1 ). (5) 

1=1 

Note that, unlike mutual information, directed information is asymmetric in its arguments, i.e., I(X n — > Y n ) ^ 
I(Y n -> X n ). 
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For a continuous-time process {X t }, let X^ = {X s : a < s < b} denote the process in the time interval [a, b). 
Throughout this section, equalities and inequalities between random objects, unless explicitly indicated otherwise, 
are to be understood to hold for all sample paths (i.e., in the sure sense). Functions of random objects are assumed 
to be measurable even though not explicitly indicated so. 

We now develop the notion of directed information between two continuous-time stochastic processes on the 
time interval [0,T). Let t = (to,t\, . . . ,t n ) denote a vector with components satisfying 

= t < ti < • • • < t n = T. (6) 

Let X '* denote the sequence of length n resulting from "chopping up" the continuous-time signal Xq into 
consecutive segments as 

X?* = {X*\X%,...,XZ_ 1 ). (7) 

Note that each component of the sequence is a continuous-time stochastic process. For a pair of jointly distributed 
stochastic processes (Xj ,Y Q T ), define 

I t (X T -> Y T ) := I(X^ -> y T '*) (8) 

n 

= ^/(y t l 1 ;X*'|y ti - 1 ) I (9) 

i=l 

where on the right side of ^ is the directed information between two sequences of length n defined in (|5}; and in 
(O we note that the conditional mutual information terms are between two continuous-time processes, conditioned 
on a third, as accommodated by the definition in (01. 

The quantity It{X^ — > Y Q T ) is monotone in t in the following sense: 

Proposition 1. Ift' is a refinement oft, i.e., {U} C {t[}, then I t '{X^ -> F T ) < 7 t (Xj -> Y T ). 

Proof: It suffices to prove the claim assuming t as in © and that t' is the (n + 2) -dimensional vector with 
components 

= t < ti < • • • < < t' < U < ■ ■ ■ < t n = T. (10) 
For such t and t', we have from (O 

i t (x T -> y t ) - i v (x T -> y T ) (ii) 
= /(y/;:, ; x** ir^- 1 ) - [/(y?^ ; jtf jy^- 1 ) + i<y$ ; x*« |y <' )] (12) 

= 7(r^ 1 ;X**|F t *- 1 )-[7(y t t J X*'|y i *- 1 ) + 7^ (13) 
= 7(X*' , Xp ; r/;^ , \Y^ ) - [/(3^'_ 1 ; X*' l^ 1 ) + I(Y t V ; , X*f \Y^ , Y^J] (14) 

= 7(x*' , x t '« ; y/;^ , y^ lyj- 1 ) - i(xgxp -> y^ , y*< i^ 4 - 1 ) (15) 

>0, (16) 
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where the last inequality follows since directed information (between two sequences of length 2 in this case) is 
upper bounded by the mutual information [1 , Thm 2]. ■ 
The following definition is now natural: 

Definition 1. Let (Xq, Y t ) be a pair of jointly distributed stochastic processes and T(0, T) be the set of all finite 
partitions of the time interval [0,T). The directed information from Xq to Y T is defined as 

I(X% -> Y T ) := t£ mf t) I t {XT -> Y T ). (17) 

Note that the definitions and conventions preceding Definition [T]imply that the directed information I(Xq — > Y Q T ) 
is alway well-defined as a nonnegative extended real number (i.e., as an element of [0, oo]). It is also worth noting, 
by recalling (|4j, that each of the conditional mutual informations in (O, and hence the sum, is a supremum over 
"space" partitions of the stochastic process in the corresponding time intervals. Thus the directed information in 
(fTTT i is an infimum over time partitions of a supremum over space partitions. Note further, in light of Proposition Q] 
that 

I(X^ -> Y T ) = lim inf I t (X? -> Y T ). (18) 

e— >0+ {t:ti— ti_i<e} 

We extend the notion of directed information to define conditional directed information I(Xq — > I^IV), where 
V ~ F(v) is a random object jointly distributed with (Xq ,Y t ), as 

I(X^ y T |F) := 1 7(X T y T |F - «) dF(«), (19) 

where I(X r — > Y Q T \V = v) on the right hand side of ( fT9l denotes the directed information, as already defined 
in Definition Q] when the pair (Xq,Y t ) is jointly distributed according to (a regular version of) the conditional 
distribution given {V = v}. 

As is clear from its definition in (0, the discrete-time directed information satisfies 

I(X n ~> Y") - /(X"- 1 Y n ~ l ) = I(Y n ;X n \Y n - x ). (20) 

A continuous-time analogue would be that, for small S > 0, 

i(xl+ s -> y *+ 5 ) - /(x* -> y *) « /(y t t+5 ; x*+ 5 |r *). (21) 

Thus, if our proposed notion of directed information in continuous time is to be a natural extension of that in 
discrete time, one might expect the approximate relation (fJT} to hold in some sense. Toward a precise statement, 
denote 

it ■= lim i/(r t t+5 ;X*+ 5 |F *) forie(0,T) (22) 

(5—^0+ 

whenever the limit exists. Assuming i t exists, let 

r 1 (t,6):=±I(Y t t+5 ;X t +s \Y a t )- lt (23) 

and note that ( 1221 is equivalent to 

lim r}(t, 6) = 0. (24) 
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Proposition 2. Fix < t < T. Suppose that i f is continuous at t and that the convergence in d24b is uniform in a 
neighborhood of t. Then 



d+ 

~dt 



I{X t ^Y*)=i t . 



(25) 



Note that Proposition [2] formalizes (|2TT i by implying that the left and right hand sides of ( |2TT i. when normalized 
by 5, coincide in the limit of small 5. 

Proof of Proposition [2} Note first that the stipulated uniform convergence in (l24l implies the existence of 
7 > and a monotone function f(S) such that 



\v(t',6)\<f(6) for all t'e[t,i + 7 ) 



(26) 



and 



Fix now < e < 7 and consider 



lim f(S) = 0. 

6->0+ 



(27) 



-t+e , v t+e 



Y^ £ ) = inf 7 t (X*+ £ F t+ ^ 

teT(0,t+e) U 



inf YHYt* -iXftlYZ'- 1 ' 
t er( o,t+ e )^ - 1 ' 



inf 

t£T(0,t+e) 



E ^(n!*: 1 ;^o i i^ i - 1 )+ E 'C^^i^ 1 ) 

i:tte[0,t) i:*ie[t,t+s) 



inf V/(r// i ;X*'|r o t! - 1 )+ inf V /(Y^ • X* 4 1^" 1 ) 

i—l i—1 



n 

I(X t ^Y t ) + inf ^(jt i -U- 1 )—-—I(Y2_ l] Xfr\YZ'-*) 

2—1 



= /(X*^y *)+ inf — • [it^ +»7(*«-l»*i — 

teT(t,t+e) ^— ' 
i—l 



(28) 
(29) 

(30) 

(31) 
(32) 
(33) 



where T(a, 6) denotes the set of all finite partitions of the time interval [a, b) and the last equality follows by the 
definition of the function r\ in < f23b . Now, 



inf y^tti - ■\i u _ 1 + r)(ti_i,ti - t%-\)\ < inf VVti - • 
teT(t,t+e) r-T ter(t,t+e) •f-f 



sup i t / + /(e) 

t'G[t,t+e) 



sup i t / + /(e) 
t'e[t,t+s) 



(34) 
(35) 



where the inequality in ( |34l ) is due to ( |26*] > and the monotonicity of /, which implies f(fi — ti_i) < /(e), as 
U — U-i is the length of a subinterval in [t,t + e). Bounding the 77 terms in (|34l from the other direction, we 
similarly obtain 



inf E(*' ~ ^-1) ' [**<-i + V(U-1, U - ti-i)] > e 
ter(t,t+e) f— < 



inf i t / - /(e) 
t'e[t,t+e) 



(36) 
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Combining <[33j, ([35]), and (O yields 

inf if - /(e) < - F ° t+£) - I{Xt ° - ^ } < sup i t , + /(e) for all e > 0. (37) 

t'e[t,t+e) e *'e[t,t+e) 

The continuity of i t at i implies lim e ^ + m ft'e[t,t+e) if = li m £^o+ su Pt'e[t t+e) = anc ^ thus, taking the limit 
e — >• + in ( f37T > and applying ( f27T > finally yields 

which completes the proof of Proposition [2] 

■ 

Beyond the intuitive appeal of Proposition [2] in formalizing d2T1 l. it also provides a useful formula for computing 
directed information. Indeed, the integral version of (l25l l is 

/(x T ->. r T ) = / it <«. (39) 

Jo 

As the following example illustrates, evaluating the right hand side of (l39l (via the definition of it in d22l ) can be 
simpler than tackling the left hand side directly via Definition Q] 

Example 1. Let {B t } be a standard Brownian motion and A ~ N(0, 1) be independent of {B t }. Let X t = A for 
all t and <2F 4 = X t cft + dB t . Letting J(P, N) = (1/2) ln((P + N)/N) denote the mutual information between a 
Gaussian random variable of variance P and its corrupted version by an independent Gaussian noise of variance 
AT, we have for every t G [0, T) 

I{Yt+ s ;Xl+ s \Yt) = J ( = Jin ( 1 



Evidently, 



1 + 1/t SJ 2 V t + 1 



i t = lim -4 In ( 1 + — — ) = — - - — -. (40) 

«->o+ 25 V * + 1/ 2(t + 1) 



We can now compute the directed information by applying Proposition [2] 

/" T /" T 1 1 
7(Xj -> Y" T ) = / i t dt = / — -dt = - ln(l + T). (41) 



o 2(t + l) 2 

Note that in this example J(Xj; F T ) = J(l, 1/T) = ± ln(l + T) and thus, by gD, we have 7(X^ -> y o T ^ 
/ (Xq 1 ; Yq T ). This equality between mutual information and directed information holds in more general situations, 
as elaborated in the next section. 

The directed information we have just defined is between two processes on [0, T). We extend this definition to 
processes of different durations by zero-padding at the beginning of the shorter process. For instance, 

I( X t- 5 -> Y T ) := /((OW" 4 ) -> F T ), (42) 

where (OqXq -6 ) denotes a process on [0, T) formed by concatenating a process that is equal to the constant for 
the time interval [0, 8) and then the process Xq ~ 5 . 
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Define now 



and 



I(X J " -> Y* ) := lim sup I(X L Q ~° -> F J ) (43) 

<5->-0+ 



J(X T - -> F T ) := lim inf 7(X T - 5 -> F T ). (44) 

T- 



Finally, define the directed information I(X — > Y ) by 

I(X*- -> F T ) := lim 7(X T - 5 -> F T ) (45) 

(5— >0+ 

when the limit exists, or equivalently, when I(Xq~ — > Y T ) = I_(Xq —> Y Q T ). 

III. Properties of the Directed Information in Continuous Time 
The following proposition collects some properties of directed information in continuous time: 

Proposition 3. Let (Xq ,Y t ) be a pair of jointly distributed stochastic processes. Then: 

1) Monotonicity: I(Xq — > Yq) is monotone nondecreasing in < t < T. 

2) Invariance to time dilation: For a > 0, if X t — X ta and Y t — Y ta , then I(X.Q^ a — > Y Q T ^ a ) = I(Xq — > Y T ). 
More generally, if <f> is monotone strictly increasing and continuous, and (X$u\,Y^n\) = (Xt,Yt), then 

I{X?^Y?)=I(X«%^Y«r>). (46) 

3) Coincidence of directed information and mutual information: If the Markov relation Yq — > Xq — > Xj holds 
for allO <t <T, then 

I(XZ^Y T ) = I(XZ;Y T ). (47) 

4) Equivalence between discrete time and piecewise constancy in continuous time: Let (U n ,V n ) be a pair of 
jointly distributed n-tuples and suppose (to>il) . . . ,t n ) satisfy (0. Let the pair (Xq , Yq) be defined as the 
piecewise-constant process satisfying 

(X u Y t ) = {Ui, Vi) ifU-x <t<U (48) 

for i = 1, . . . , n. Then 

IlyX? -> Y Q T ) = I(U n -> V n ). (49) 

5) Conservation law: For any < S < T we have 

I(X 5 ; Y*) + I(Xj -> Y?\Y$) + I(Y T - 5 X?) = I(X^; Y T ). (50) 

In particular, 

a) Iimsup[/(X 5 ; Y S ) + I(Xj -> Y 5 T \Y S )} = I(X^; Y Q T ) - l(Y T - X T ). (51) 

5^0+ 

b) liminf [I(X^ Y Q S ) + I(Xj -> Y S T \Y 5 )} = I(X^- Y T ) - 1(Y T - -> X T ). (52) 

<5->-0+ 
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c) If the continuity condition 

lim . [I(X 5 ; Y S ) + I(Xj -> Y S T \Y 5 )} = I(x£ -> F T ) (53) 

(5— >0+ 

holds, then the directed information I(Y T ~ ~* -^o) exists an d 

I(X? -> F T ) + 7(r T - -> X T ) - 7(X T ; F T ). (54) 

Remarks. 

1) The first, second, and fourth parts in the proposition present properties that are known to hold for mutual 
information (when all the directed information expressions in those items are replaced by the corresponding 
mutual information), which follow immediately from the data processing inequality and the invariance of 
mutual information to one-to-one transformations of its arguments. That these properties hold also for directed 
information is not as obvious in view of the fact that directed information is, in general, not invariant to one- 
to-one transformations nor does it satisfy the data processing inequality in its second argument. 

2) The third part of the proposition is a natural analogue of the fact that I(X n ; Y n ) = I(X n — > Y n ) whenever 
Y l — > X 1 — > X* l +1 form a Markov chain for all 1 < i < n. It covers, in particular, any scenario where Xq 
and Y Q T are the input and output of any channel of the form Y t — gt{X^, Wq), where the process Wq (which 
can be thought of as the internal channel noise) is independent of the channel input process Xq . To see this, 
note that in this case we have (Xq, Wq) -)• Xq -> Xj for all < t < T, implying F * ->• Xq -)• Xj since 
Yq is determined by the pair (Xq, Wq). 

3) Particularizing even further, we obtain I(Xq — > Y Q T ) = I(Xq;Y q t ) whenever Y Q T is the outcome of 
corrupting Xq with additive noise, i.e., Y t — X t + Wu where Xq and Wq are independent. 

4) The fifth part of the proposition can be considered the continuous-time analogue of the discrete-time 
conservation law 

I(U n -> V n ) + ItV"- 1 -> U n ) = I(U n ; V n ). (55) 

It is consistent with, and in fact generalizes, the third part. Indeed, if the Markov relation Yq — > Xq —t Xj 
holds for all < t < T then our definition of directed information is readily seen to imply that I(Y T ~ S — > 
Xq) ~ for all S > and therefore that I(Y T ~ — > Xq) exists and equals zero. Thus d54l i in this case 
reduces to ([47). 

Proof of Proposition \3\ The first part of the proposition follows immediately from the definition of directed 
information in continuous time (Definition [T]) and from the fact that, in discrete time, I(U m — > V m ) < I(U n — > V n ) 
for m < n. The second part follows from Definition Q] upon noting that, under a dilation <f> as stipulated, due to 
the invariance of mutual information to one-to-one transformations of its arguments, for any partition t of [0, T), 

I t (Xi -+ Yq T ) = I m {X*M Y$>), (56) 
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where </>(t) is shorthand for (<p(to, <f>(ti), . . . , <f>(t n )). Thus 

I(Xj^Y Q T ) = M I t (X^Y T ) (57) 

teT(0,T) 

= te it) / ^ (1 S^ f ^) )) (58) 

= '(*Jg? -+ (60) 

where d57| i and ( |60l ) follow from Definition Q] d58l > follows from d56l l, and ( |59l l is due to the strict monotonicity 
and continuity of <fi which implies that 

j>(t) : t is a partition of [0, T)} = {t : t is a partition of [0(0), <p(T))}. (61) 

Moving to the proof of the third part, assume that the Markov relation Yq — )• Xg — > Xj holds for all < t < T 
and fix t = (to, t\, . . . , t n ) as in ||6). Then 

h (x T -> y T ) = i(x^ -> *f *) (62) 

AT 

= ^7(F t *l 1 ;^|F * i - 1 ) (63) 

i=l 

AT 



^/(r^;^!^- 1 ) (64) 

i=l 

J(*oW), (65) 



where (|64l follows since Yq —5- .Xq — > -X"^ for each 1 < i < N, and d65l ) is due to the chain rule for mutual 
information. The proof of the third part of the proposition now follows from the arbitrariness of t. 

To prove the fourth part, consider first the case n — 1. In this case X t = U\ and Y t = Vi for all t € [0, T). It is an 
immediate consequence of the definition of directed information that I((U, U, ...,U)—> (V, V, ... , V)) = I(U; V) 
and therefore that I t (X£ -> F T ) = I(77i; Vi) = J(Z7i -> Vi) for all t. Consequently I(Xg ->• Y~ T ) = 7(Z7i -)• Vi), 
which establishes the case n = 1. For the general case n > 1, note first that it is immediate from the definition 
of Jt(lJ -> F T ) and from the construction of (Xfi, Y Q T ) based on (X n ,Y n ) in gSJl that for t = (* ,ii, . . . ,£„) 
consisting of the time epochs in @8]> we have J t (X^ -> Y T ) = I(U n -> F). Thus J(Xj -> F T ) < I t (I T -> 
F T ) = /([/" -> V"). We now argue that 

7 S (X T -> r T ) > I(U n -> V") (66) 

for any partition s. By Proposition [1] it suffices to establish d66*1 ) with equality assuming s is a refinement of the 
particular t just discussed, that is, s is of the form 

= t = s 0) o < so,i < ■ ■ • < so, Jo <h = si t o < si_i < ■ • • < si.jj < i 2 = s 2 ,o < ■ • • < s„_i,j ri _ 1 < t n = T. 

(67) 
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Then, 



I s (X^Y-)=I(X^^Y T n 



i=l j=l 
n 

= Y J i(u l -v l \u^ 1 ) 

i=l 

= I{U n -> V n ), 



(68) 
(69) 

(70) 
(71) 



where (T70T > follows by applying a similar argument as in the case n = 1. 

Moving to the proof of the fifth part of the proposition, fix t = (to,tx, . . . ,t„) as in © with t\ = 8 > 0. 
Applying the discrete-time conservation law (1551 1. we have 



/t(x T -> r T ) + / t (y T - 5 -> x T ) - /(x T ; r T ) 



and consequently, for any e > 0, 



inf 7 t (X T ^y T ) 



inf 7 t (y J 

{t:ti — <5,max^>2 ti — ti_i<e} 



T-5 



inf [i t (x T ^r T ) + / t (y T - 5 ^x T )] 

|t:ci — d,maXj>2 ii — _ i <£} 

i(x T ; y T ), 



(72) 

(73) 
(74) 
(75) 
(76) 



where the equality in ( |74) follows since due to its definition in (02), 7t(^o T * — * -^cf ) does not decrease by refining 
the time interval t in the [0, 5) interval; the equality in ( 175) follows from the refinement property in Proposition Q] 
which implies that for arbitrary processes Xq , Yq ' , Zq , Wq and partitions t and t' there exists a third partition t" 
(which will be a refinement of both) such that 



and the equality in (176) follows since d72) holds for any t = (to, t\, . . . ,t n ) with t\ = 6. Hence, 



lim 

s->-0+ 



inf /,(I J -+YJ)+ inf 7 t (r o J 

{t:ti=<5,maxi>2 £j — ii_i<e} {t.'maXj — 



T-<5 



lim inf JtPQf r o T ) + lim inf h(Yo~ S -4 Xj) 

e— )-0+ {t:£i— (5,max^>2 ii— ii-i<e} e— >-0+ {t:maxi tj — <e} 



lim inf 

s— {t:£i— 6,max^>2 if— — i<e} 



i=2 



X< 



lim 

£^■0+ {t:ii — 6,m.ax^>2 ti — ti— 1 



i=2 



I(Xl- Y S ) + I(Xj -> ifFio ) + 7(y T - 5 -> X T ), 



(77) 

(78) 
(79) 

(80) 

(81) 
(82) 
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where the equality in d78l l follows by taking the limit e — > from both sides of d76l i; the equality in ( fSOb follows 
by writing out It{X^ — > Y T ) explicitly for t with t\ = S and using ( fTST l to equate the second limit in ( |79] l with 
/(YQ T ~ d — > Xj); and the equality in (|82| > follows by applying ([T8l on the conditional distribution of the pair 
(Xj, Yjf) given Y . We have thus proven (1501 . or equivalently, the identity 

7(X 5 ; r 5 ) + 7(Xf -> lf|r 5 ) = 7(X T ; Y a T ) - I{Y^ S -> X T ). (83) 

Finally, the identities in (IBTl i and (T52t follow by considering the limit supremum and the limit infimum, respectively, 
of both sides of (l83l . The identity in d54l i is an immediate consequence of ( T5Tb and d52l >. ■ 

IV. Directed Information, Feedback, and Causal Estimation 
A. The Gaussian Channel 

In 1 18 1, Duncan discovered the following fundamental relationship between the minimum mean squared error 
(MMSE) in causal estimation of a target signal corrupted by an additive white Gaussian noise (AWGN) in continuous 
time and the mutual information between the clean and noise-corrupted signals: 

Theorem 1 (Duncan H18I ). Let Xq be a signal of finite average power E[X^]dt < oo, independent of a 
standard Brownian motion {B t }. Let Y T satisfy dY t = X t dt + dB t . Then 

\£ E[(X t -E[X t \Y t ]) 2 ]dt = I(X r ;Y T ). (84) 

A remarkable aspect of Duncan's theorem is that the relationship ( [84-b holds regardless of the distribution of 
Xq. Among its ramifications is the invariance of the causal MMSE to the flow of time, or more generally, to any 
reordering of time [20|, lETl . 

A key stipulation in Duncan's theorem is the independence between the noise-free signal Xq and the channel 
noise {B t }, which excludes scenarios in which the evolution of X t is affected by the channel noise, as is often the 
case in signal processing (e.g., target tracking) and communication (e.g., in the presence of feedback). Indeed, the 
identity (f84t does not hold in the absence of such a stipulation. 

As an extreme example, consider the case where the channel input is simply the channel output with some delay, 
i.e., 

X t+£ = Y t (85) 

for some e > (and X t = for t £ [0,e)). In this case the causal MMSE on the left side of ( |84l > is clearly 0, 
while the mutual information on its right side is infinite. On the other hand, in this case the directed information 
I(Xq — > Y Q T ) = 0, as can be seen by noting that ItiXg — > Y T ) = for all t satisfying ma.Xi(ti — U-i) < e 
(since for such t, Xq is determined by Y^ 1 for all i). 

The third remark following Proposition [3] implies that Theorem Q] could be equivalently stated with I(Xq;Y q t ) 
on the right side of ( [84-b replaced by I(Xq — > Y T ). Furthermore, such a modified identity would be valid in the 
extreme example in (l85l >. This is no coincidence and is a consequence of the result that follows, which generalizes 
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Duncan's theorem. To state it formally we assume a probability space (f2, J 7 , P) with an associated filtration {Tt\ 
satisfying the "usual conditions" (right-continuous and .Fo contains all the P-negligible events in F, cf., e.g., Il22l 
Definition 2.25]). Recall also that when the standard Brownian motion is adapted to {J 7 *} then, by definition, it is 
implied that, for any s < t, B t — B s is independent of F s (rather than merely of Bq, cf., e.g., l22l Definition 1.1]). 



Theorem 2. Let {(X t , B t )}f = Q be adapted to the filtration {J- t }f =0 , where Xq is a signal of finite average power 
/„ E[X 2 ]dt < oo and B% is a standard Brownian motion. Let Y T be the output of the AWGN channel whose 
input is Xq and whose noise is driven by Bq, i.e., 

dY t = X t dt + dB t . (86) 

Suppose that the regularity assumptions of Proposition^ are satisfied for all < t < T. Then 

~ J T E[(X t - E[X t \Y*\)*]dt = I(X^ -> Y Q T ). (87) 

Note that unlike in Theorem Q] where the channel input process is independent of the channel noise process, 
in Theorem [2] no such stipulation exists and thus the setting in the latter accommodates the presence of feedback. 
Furthermore, since I(Xq — > Y T ) is not invariant to the direction of the flow of time in general, Theorem|2]implies, 
as should be expected, that neither is the causal MMSE for processes evolving in the generality afforded by the 
theorem. 

That Theorem Q] can be extended to accommodate the presence of feedback has been established for a 
communication theoretic framework by Kadota, Zakai, and Ziv l23l . Indeed, in communication over the AWGN 
channel where Xq = Xq(M) is the waveform associated with message M, in the absence of feedback the Markov 
relation M — > Xq — > Y Q T implies that I(Xq;Y t ) on the right hand side of d84i >. when applying Theorem [T] in 
this restricted communication framework, can be equivalently written as I(M; Y T ). The main result of [23 1 is that 
this relationship between the causal estimation error and I(M;Y T ) persists in the presence of feedback. Thus, the 
combination of Theorem [2] with the main result of l23l implies that in communication over the AWGN channel, with 
or without feedback, we have I(M; Y Q T ) = I(Xq — >• Y Q T ). This equality holds well beyond the Gaussian channel, 
as is elaborated in Section [Vl] Note further that Theorem [2] holds in settings more general than communication, 
where there is no message but merely a signal observed through additive white Gaussian noise, adapted to a general 
filtration. 

Theorem [2] is a direct consequence of Proposition [2] and the following lemma. 

Lemma 1 ( B241 X Let P and Q be two probability laws governing (Xq , Y Q T ), under which (186b and the stipulations 
of Theorem [2] are satisfied. Then 



D{P Y(T \\Q y t)= 1 -E p 



[ T (X t - Eq^Yq 1 ]) 2 - {X t - Ep[X t \Y$\)*dt 
Jo 



(88) 



Lemma[T]was implicit in [24|. It follows from the second part of [24 Theorem 2], put together with the exposition 
in [24, Subsection IV-D] (cf., in particular, equations (148) through (161) therein). 
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Proof of Theorem [2} Consider 

I(Y t t+S ; Xq +s \Y^ ) = D(P Y t+s ]x t+s Y t \\P Y t+s\ Y t \P Y t <x t+s) 

^( p y/+ ^ |x^ ^ =x^+^Y *=^ll p y/+ a |y '= a *) dP y ^x^ ^ (yo^t + ^ 



(89) 
(90) 



= 2 IE 



t+s 



{x s - E[X S \Y S }) 2 - {x s - x s ) 2 ds 



dP yt Xt+ s(y t ,x t t +s ) (91) 



t+s 



E[(X S - E[X s \Y s ]) 2 ]ds, 



(92) 



where the equality in d9Tb follows by applying d88l to the integrand in ( |90l > as follows: replacing the time interval 
[0,T) by [t,t + S), substituting P by the law of (X l t +S ,Y t t+s ) conditioned on {y t ,x t t +s ) (note that X l t +S is 
deterministic at x\ +5 under this law), and substituting Q by the law of [X\ +s , Y* +s ) conditioned on y^. It follows 
that i t defined in d22l) exists and is given by 



i t = ~E[{X t -E[X t \Y*]f], 



(93) 



which completes the proof by an appeal to Proposition [2] 



(94) 



B. The Poisson Channel 

Consider the function I : [0, oo) x [0, oo) — > [0, oo] given by 

£(x,x) — xlog(x/x) — x + x. 

That this function is natural for quantifying the loss when estimating non-negative quantities is implied in 
Section 2], where some of its basic properties are exposed. Among them is that conditional expectation is the 
optimal estimator not only under the squared error loss but also under £, i.e., for any nonnegative random variable 
X jointly distributed with Y, 

minE\e(X,X(Y))] = E [£{X, E{X\Y))] , (95) 
*(•) L J 

where the minimum is over all (measurable) maps from the domain of Y into [0, oo). With this loss function, the 
analogue of Duncan's theorem for the case of Poisson noise can be stated as: 

Theorem 3 ( [25], [26]). Let Y$ be a doubly stochastic Poisson process and Xq be its intensity process (i.e., condi- 
tioned on Xq, Yq is a non-homogenous Poisson process with rate function Xq) satisfying E \X t \ogX t \dt < oo. 
Then 

(96) 



E{e(X t ,E{X t \Yt})}dt = I(X£-X). 

We remark that for <fi(a) = a log a, one has 

E[4>(X t ) - d>(E[X t \Y^})] = E[l{X u E[X t \Y*% 



and thus (|96] l can equivalently be expressed as 

E[^{X t ) - </>(E[X t \Y'])]dt = I(X£ ; Y T ), 



(97) 



(98) 
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as was done in |26] and other classical references. But it was not until ||25l that the left hand side was established as 
the minimum mean causal estimation error under an explicitly identified loss function, thus completing the analogy 
with Duncan's theorem. 

The condition stipulated in the third item of Proposition [3] is readily seen to hold when Yq T is a doubly stochastic 
Poisson process and Xq is its intensity process. Thus, the above theorem could equivalently be stated with directed 
information rather than mutual information on the right hand side of d96l l. Indeed, with continuous-time directed 
information replacing mutual information, this relationship remains true in much wider generality, as the next 
theorem shows. In the statement of the theorem, we use the notions of a point process and its predictable intensity, 
as developed in detail in, e.g., Il27l Chapter II]. 

Theorem 4. Let Y t be a point process and X t be its J-J -predictable intensity, where is the a-field <t(Yq) 
generated by Yq. Suppose that E J Q T \X t \ogX t \dt < oo, and that the assumptions of Proposition^ are satisfied 
for allQ<t <T. Then 



Jo 

Paralleling the proof of Theorem [2] the proof of Theorem [4] is a direct application of Proposition [2] and the 
following: 

Lemma 2 ( M25I ). Let P and Q be two probability laws governing (Xq,Y t ) under the setting and stipulations 
of Theorem Then 



Lemma [2] is implicit in 11251 . following directly from [25, Theorem 4.4] and the discussion in [25, Subsection 
7.5]. Equipped with it, the proof of Theorem H] follows similarly as that of Theorem|2] the role of ( f88l > being played 
here by JlOOb . 



Let X = {X t } and Y = {Y t } be the input and output processes of the continuous-time Poisson channel with 
feedback, where each time an event occurs at the channel output, the channel input changes to a new value, drawn 
according to the distribution of a positive random variable X, independently of the channel input and output up 
to that point in time. The channel input remains fixed at that value until the occurrence of the next event at the 
channel output, and so on. Throughout this section, the shorthand "Poisson channel with feedback" will refer to 
this scenario, with its implied channel input process. The Poisson channel we use here is similar to the well-known 
Poisson channel model (e.g., Il28l - ll35l ) with one difference that the intensity of the Poisson channel changes 
according to the input X only when there is an event at the output of the channel. Note that the channel description 
given here uniquely determines the joint distribution of the processes. 




(99) 




(100) 



V. Example: Poisson Channel with Feedback 
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In the first part of this section, we derive, using Theorem [4] a formula for the directed information rate of this 
Poisson channel with feedback. In the second part, we demonstrate the use of this formula by computing and 
plotting the directed information rate for a special case in which the intensity alphabet is of size 2. 



A. Characterization of the Directed Information Rate 

Proposition 4. The directed information rate between the input and output processes of the Poisson channel with 
feedback is 

where, in I(X; Y) on the right hand side, Y \{X = x} ~ Exp(x), i.e., the conditional density of Y given {X = x} 

is f(y\x) = xe~ yx . 

The key component in the proof of the proposition is the use of Theorem [4] for directed information in 
continuous time as a causal mean estimation error. For simplicity of notation, we assume in the derivation of 
( 1 1 lb that X is discrete with probability mass function (pmf) px(x), the extension to general distributions being 
obvious. An intuition for the expression in (1101b can be obtained by considering rate per unit cost [36], i.e., 
R = I(X ;Y) / E[b(X)], where b(x) is the cost of the input. In our case, the "cost" of X is proportional to the 
average duration of time until the channel can be used again, i.e., b(x) = 1/x. 

To prove Proposition [4] let us first collect the following observations: 



Lemma 3. Let X ~ px{x) and Y\{X = x} ~ Exp (a;). Define 

T,x xe ~ 
J2 X e~ tx px(x 



g(t) := E[X\Y > t] = ^ Xe _!y x j- X } , t>0. (102) 



and consequently 



Then the following statement holds. 

1) The marginal distribution of X t is 

P{X t = X } = ^MMM (103) 

^ t logX t ] = ^i. (104) 

2) Let I = ^(YZoc) denote the time of occurrence of the last (most recent) event at the channel output prior to 
time and define r := —I. The density of r is 

m) = ^ xe E ^ x \ t>o. (io5) 



3) For t distributed as in ( 11051 ), 



E[g{r)\ogg(r)]= l E[ ^ . (106) 
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Proof: For the first part of the lemma, note that X t is an ergodic continuous-time Markov chain and thus 
P{X t = x} is equal to the fraction of time that X t spends in state x which is proportional to (l/x)px(x), 
accounting for ( 1103b . which, in turn, yields 

Ff.Ylnr.Yl-V (iMj^M r i n „ T _ Y^ x Px{x)\ogX _ E\[ogX] 

£j A f log A t — > — — — r-XlOgX — — — p— — — , (1U/J 

accounting for ( 1104b . 

To prove the second part of the lemma, observe that 

(a) the interarrival times of the process Y are i.i.d. ~ Y; 

(b) Y has a density 

Mv)='E f Px{x)xe-* v , y>0, (108) 

X 

(c) the probability density of the length of the interarrival interval of the Y process around is proportional to 

fy(y) ■ V, and 

(d) given the length of the interarrival interval around is y, its left point is uniformly distributed on [—y, 0]. 

Letting Unif [0, y] (•) denote the density of a random variable uniformly distributed on [0,y], it follows that the 
density of r is 

= r°° f /l y l' V u ; Unif[0,y](t)dy (109) 
Jo Jo iY{y')-v'dy' 

fy{ytv__l d (110) 



« Jo fviy') -y'dy' y 

(ill) 



J2 x px( 



x e 



-tx 



(112) 



(113) 



E[l/X] 

where ( 1109b follows by combining observations (c) and (d), and (111 II) follows by substituting from ( 1108b . We have 
thus proven the second part of the lemma. 

To establish the third part, let Fy (t) denote the cumulative distribution function of Y and consider 

p oo 

E[g(r) log g(r)} = / T (*)<?(*) log <?(*) (114) 
Jo 

E[l/X] E x e~ tx Px(x) 8 1 3) 

1 [V^ ^,„„L^VW 



rn r x: ^-^px bg ^ a (HQ 

' /"/ y (t)log-AP-di (117) 



B[l/X]y J * w b l-frW 
i / r°° i 

M*) lo Si Ew^ d * ~ h (y) ) (H8) 



£[1/X]U> "l-^rW 
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= Wm{l ^J^-Ju-HY)) (119) 

= EWxf~HY)), (120) 

where (11 151 ) follows by substituting from the second part of the lemma and (11 17t follows by substituting from 
( |108b and noting that 

£ -tx roo 

^ J t 

X X X 

poo poo 

= J2P x ( x ) xe ~ XVd y= f Y (y)dy=l-F Y (t). (121) 
Jt x Jt 

We have thus established the third and last part of the lemma. ■ 

Proof of Proposition^} We have 

/(X -> Y) = lim i/(X T -> Y T ) (122) 

= lim I / S[X t logJr t -£?[JT t |lo r ]logB[Jr t |Y? , ]]d* (123) 

= £[X logXo - EiXolY^] logEiXolY^}] (124) 

= ff^ 1 - ^W^J log^Jfoir?,,,]] , (125) 

where ( 11231) follows from the relation between directed information and causal estimation in d99l : (1124b follows 
from the stationarity and martingale convergence; and (11251 ) follows from the first part of Lemma[3] Now, recalling 
the definition of the function g in ( 11021 ) we note that 

E[X \t(Y° oo )]=g(-t(Y? oo )). (126) 

Thus 

£[£[Xo|*l°oJ logE[X \Y°J\ = E^X^Y^)] log £7[* |*(^oo)]] ( 127 ) 

= E[ ff (-^(y oo ))log 9 (-^(y° oo ))] (128) 

= «T)tog 5 (T)] (129) 

= 1 - k ^ (130) 

E[l/X] ' 1 ; 

where < fT27t follows from the Markov relation ^ -> ^(Y^) -> *o, CGD follows from (fT26b . and lfl30l from 

the last part of Lemma [3] Thus 

/(X^Y)^Mll_i±4^ (131) 
v 7 £[1/X] 

= h ^ - ffljQ (132) 

= MlTL (133) 
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where ( 1131b follows by combining J 125b with J 1 30b . and J 1 32b follows by noting that 

h{Y\X) = h{Y\X = x)p x (x) = ^(1 - logx) Px (x) = 1 - E[logX]. (134) 

X X 

This completes the proof of Proposition |4] ■ 

B. Evaluation of the Directed Information Rate 

Fig. Q] depicts the directed information rate /(X — > Y) for the case where X takes only two values Ai and A2. 
We have used numerical evaluation of I(X; Y) in the right hand side of ( 1 1 lb to compute the directed information 
rate. The figure shows the influence of p — P{X — Ai} on the directed information rate where Ai = 1 and A2 = 2. 
As expected, the maximum is achieved when there is higher probability that the encoder output will be the higher 
rate A2, which would imply more channel uses per unit time, but not much higher as otherwise the input value will 
be close to deterministic. 



0.1 
T 0.06 
0.02 

0.2 0.6 1 

p:=P{X = Ax} 

Fig. 1. The directed information rate between the input and output processes for the continuous-time Poisson channel with feedback, as a 
function of P(x), the pmf of the input to the channel. The input to the channel is one of two possible values Ai = 1 and A2 = 2, and it is 
the intensity of the Poisson process at the output of the channel until the next event. 

Fig. [2] depicts the maximal value (optimized w.r.t. P{X = Xi}) of the directed information rate when Xi is 
fixed and is equal to 1 and A2 varies. This value is the capacity of the Poission channel with feedback, when the 
inputs are restricted to one of the two values Ai or A2. When A2 = the capacity is obviously zero since any 
use of X = X-2 as input will cause the channel not to change any further. It is also obviously zero at A2 = 1 
since in this case Ai = A2, so there is only one possible input to the channel. As A2 increases, the capacity of the 
channel increases unboundedly since, for A2 3> Ai, the channel effectively operates as a noise-free binary channel, 
where one symbol "costs" an average duration of 1 while the other a vanishing average duration. Thus the limiting 
capacity with increasing A 2 is equal to lim p ^ H(p)/p = 00 ■ 

One can consider a discrete-time memoryless channel, where the input X is discrete (Ai or A2) and the output Y 
is distributed according to Exp(X). Consider now a random cost b(X) = Y, where Y is the output of the channel. 
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Fig. 2. Capacity of the Poisson channel with feedback, in case where channel input is constrained to the binary set {Ai, A2}, when Ai is 
fixed and is equal to 1 and A2 varies. 



Using the result from [36| we obtain that the capacity per unit cost of the discreet memoryless channel is 

I(X;Y) I(X;Y) 
max F — : — = max — = — ; — (135) 

p(x) E[Y] P(x) E[l/X] ' 

where the equality follows since E[Y] = E[E[Y\X]] = E[l/X]. Finally, we note that the capacity of the Poisson 
channel in the example above is the capacity per unit cost of the discrete memoryless channel. Thus, by Proposition!!] 
we can conclude that the continuous-time directed information rate characterizes the capacity of the Poisson channel 
with feedback. In the next section we will see that the continuous-time directed information rate characterizes the 
capacity of a large family of continuous-time channels. 



VI. Communication over Continuous-Time Channels with Feedback 

We first review the definition of a block-ergodic process as given by Berger lf37ll . Let (X, X, fi) denote a 
continuous-time process {X t }t>o drawn from a space X according to the probability measure /i. For t > 0, let T l be 
a i-shift transformation, i.e., (T t x) s = x s +t- A measurable set A is t-invariant if it does not change under the t-shift 
transformation, i.e., T l A = A. A continuous-time process (X, X, //) is r-ergodic if every measurable r-invariant 
set of processes has either probability 1 or 0, i.e., for any r-invariant set A, in other words, fi(A) = (fi(A)) 2 . The 
definition of r-ergodicity means that if we take the process {X t }t>o and slice it into time-blocks of length r, then 
the new discrete-time process (X T , X 2t , X^, . . .) is ergodic. A continuous-time process (X, X, fj) is block-ergodic 
if it is r-ergodic for every t > 0. Berger |37| showed that weak mixing (therefore also strong mixing) implies 
block ergodicity. 

Now let us describe the communication model of our interest (see Fig. [3) and show that the continuous-time 
directed information characterizes the capacity. Consider a continuous-time channel that is specified by 

• the channel input and output alphabets X and y, respectively, that are not necessarily finite, and 



21 



M e {i,...,2 n1 } 

Message 



Encoder 



Channel 



Decoder 



xt{m,yl A ) 



Yt-A 



g(x t ,z t 



Delay A 



M 



Message estimate 



Fig. 3. Continuous-time communication with delay A and channel of the form Yt = g(Xt,Zt), where Zt is a block ergodic process. 



• the channel output at time t 

Y t = g(X u Z t ) (136) 

corresponding to the channel input X t at time t, where {Z t } is a stationary ergodic noise process on an 
alphabet Z and g : X x Z — > y is a given function. 

We assume that the conditioned cumulative distribution function (cdf) F (yl +s \xl +s , yfy is well-defined for any 
t > and 5 > 0. [QUESTION: Is this assumption absolutely necessary or can be proved from a lighter assumption 
on g and Zl] 

A (2 TR ,T) code with delay A > for the channel consists of 

. a message set {1, 2, . . . , 2^ TR ^}, 
m an encoder that assigns a symbol 

x t (m,y^ A ) (137) 

to each message to e {1,2,..., 2^ TR ^ } and past received output signal yij A G 37[°^^ A ) for t 6 [0, T), and 
> a decoder that assigns a message estimate m(y(f) £ {1, 2, . . . , 2^ TR ^ } to each received output signal y$ £ 

We assume that the message M is uniformly distributed on {1, 2, ... , [2 Tfl J} and independent of the noise process 
{Zt}. From (1136b . we have 

Fiy^^+^ylm) = ^(^ + Vo +5 ^o), (138) 

which is analogous to the assumption in the discrete case that p(y n+ i \x n+1 , y n , m) = p(y n+ i\x n+1 , y n ). 

From the definition of the encoding function in (1137b . we note that the conditioned cdf F(x t t +5 \x t , yl +S ) exists, 
and for any t > 0, 6 > 0, and A > 6, 

F(x t t +s \x f ,y t )^F(x t + s \xly t +s - A ). (139) 

This is analogous to the assumption in the discrete case that whenever there is feedback of delay d > 1, 
p(x n+1 \x n , y n ) = p(x n+1 \x n , y n+1 - d ). 

Similar communication settings with feedback in continuous time were studied by Kadota, Zakai, and Ziv ll38l 
for continuous-time memoryless channels, where it is shown that feedback does not increase the capacity, and by 
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Ihara [39 1, |40| for the Gaussian case. Our main result in this section is showing that the operational capacity, 
defined below, can be characterized by the information capacity, which is the maximum of directed information 
from the channel input process to the output process. Next we define an achievable rate, the operational feedback 
capacity, and the information feedback capacity for our setting. 

Definition 2. A rate R is said to be achievable with feedback delay A if for each T there exists a family of 
(2 RT ,T) codes such that 

lim P{M + M(T n T )} = 0. (140) 

T^>oo 

Definition 3. Let 

C(A) = sup{i? : R is achievable with feedback delay A} (141) 
be the (operational) feedback capacity with delay A, and let the (operational) feedback capacity be 

C = sup C(A). (142) 

A>0 

From the monotonicity of C(A) in A we have sup A>0 C(A) = lim A ^ C(A). 
Definition 4. Let C 1 (A) be the information feedback capacity defined as 

C I (A) = lim isu P /(X T ^r T ), (143) 

T->oo 1 5a 

where the supremum in (1143b is over 5a, which is the set of all channel input processes of the form 

[g t (U t ) t < A, 

some family of functions {gt}J = o, and some process Uq which is independent of the channel noise process Zq 
(appearing in (1136b ) and has a finite cardinality that may depend on T. 

The limit in (1143b is shown to exist in Lemma |4] using the supperadditivity property. We now characterize C(A) 
in terms of C 7 (A) for the class of channels defined in ( 1 1 36b . 

Theorem 5. For the channel defined in ( 1136b , 

C(A) < C J (A), (145) 
(7(A) > C'(A') for all A' > A. (146) 

Since C 1 (A) is a decreasing function in in A, ( 1146b may be written as C(A) > lini5^A+ C 1 (S), and the limit 
exists because of the monotonicity. Since the function is monotonic then C I (A) = lim^^A+ C I {8) with a possible 
exception of the points of A of a set of measure zero BTl p. 5]. Therefore C(A) = C I (A) for any A > except 
of a set of points of measure zero. Furthermore ( 1145b and (1146b imply that sup A>0 C(A) = sup A>0 C 1 (A), hence 
we also have C — sup A>0 C 1 (A) = lim A= o C 1 (A). 
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Before proving the theorem we show that the limits in d 143b exists. 
Lemma 4. The term sup <SA I(Xq — > Yq) is superadditive, namely, 

sup/(x Ti+T2 -> r Ti+T2 ) > su P /(x Ti -> y Ti ) + su P /(x T2 -> r T2 ), (147) 

5a 5a 5a 

a/id therefore the limit in dl431 > exists and is equal to 

lim - sup 7(X T -> F T ) = sup i sup 7(X T -> r o T ) (148) 

To prove Lemma |4] we use the following result: 

Lemma 5. Let {(Xj, 5 / i)}2=x™ £>e a pair o/ discrete-time processes such that Markov relation Xj — >• 
(X' -1 , Y i_1 ) -> (X^, Y^+i) holds for tG{n+l,n + 2,...,n + to}. TTzen 

/(x n+m -> y n + m ) > /(x n -> r n ) + i(x; i +™ -> r^r), (149) 

Proof: The result is a consequence of the identity |4] Eq. (11)] 

n 

/(X" -^Y rn ) = ^I{X i ;Y7 l \X i - l ,Y i - 1 ). (150) 

2 = 1 

Consider 

n+m 

7(X" +m -> y»+ m ) = HXxYplX 1 - 1 ,^ 1 ) (151) 
»=i 

n n+m 

= ^/(X ; y l n |x l - 1 ,r J - 1 )+ £ /(x^yfix^ 1 ,^- 1 ) (152) 

2—1 i— n+1 

n n+m 

>^/(x i; y i "|x i - 1 ,y i - 1 )+ J; /(x^ix^y^) (153) 

Z— 1 2—71+1 

= /(X" -»• Y n ) + I(X™+™ -> F„ n Jr), (154) 



where ( 11511 ) follows from the identity given in ( 11501 l. and ( 11531 ) follows from the Markov chain assumption in the 
lemma. ■ 
Proof of Lemma [?} First note that we do not increase the term inft /t(Xj 1+T2 -+ Y Tl+T2 ) restricting 
the time-partition t to have an interval starting at point T\. Now fix three time-partitions: ti in [0,Ti), t 2 in 
[Ti,Ti + T 2 ), and t in [0,Ti + T 2 ) such that t is a concatenation ti and t 2 . For Xj 1 and X^ +T2 , fix the input 
functions of the form of (1144b and fix the arguments U Tl and U^ +T2 which corresponds to Xj 1 and X T 1+T2 , 
respectively. The construction is such that the random processes U Tl and U^ +T2 are independent of each other. Let 
Xq 1+T2 be a concatenation of Xj 1 and Xy 1+ 2 . Applying Lemma|5]on the discrete-time process {(Xj, Yj)}™_t| m , 
where (X, Y) = (X t **+\ Y// +1 ) for i = 1, 2, . . . , n + to we obtain that for any fixed t x , t 2 , Xj 1 , X^ +T2 , f/ T \ 
and U^ +T2 as described above, we have 

j t (x Ti+T2 -> y Ti+T2 ) > / tl (x Ti -> y Ti ) + / t2 (x^ +T2 -». y£ +T2 ). (155) 
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Note that the Markov condition Xi ->■ (X^ 1 ,Y t ~ 1 ) -)• (X^, Y^) indeed holds because of the construction of 
Xq 1+T2 . Furthermore, because of the stationarity of the noise ( 1155b implies ( 1147b . Finally, using Fekete's lemma 
11421 Ch. 2.6] and the superadditivity in (1147b implies the existence of the limit in d 148b . ■ 
The proof of Theorem[5]consists of two parts: the proof of the converse, i.e., ( 11451 ), and the proof of achievability, 
i.e., lfl46l 

Proof of the converse for Theorem [3} Fix an encoding scheme {ft }t=Q with rate R and probability of 
decoding error, pj?^ = P{M ^ M(Y" T )}. In addition, fix a partition t of length n such that U — ti-\ < A for 
any i G [1,2, ... ,n] and let t n = T. Consider 

RT = H(M) (156) 

= H(M) + H(M\Yo) - H(M\Yq) (157) 

<I{M;Y^) + Te T (158) 

= 7(M; Y*> , Yg Y£_J + Te T (159) 

n 

= KM; Yl;_ x \Y^) + Te T (160) 



i=i 

n 

KM, Xl*- 1+A ; Y£_ x \Y^ ) + Te T (161) 

i=i 

n 

= J2nM,X^,X t t i- 1+A ;Y t t : _ i \Y^- 1 ) + Te T (162) 

n 

= E 7 ( M ' X o J y *lx l y o i_I ) + ^'r 1+A ; ^ l^o 1 - 1 , M, + T6 T (163) 

n 

= ^7(X*-y^_jy t '- 1 ) + 7(X t t ;- I+A ;r/;: i |y ti - 1 ! M,X^)+Te T (164) 

n 

= £ /(X** ; 5^ ll^ 1 ) + Te T (165) 

= JtpQf ^r o T )+Te T) (166) 

where the equality in ( 1156b follows since the message is distributed uniformly, the inequality in ( 11581 ) follows 
from Fano's inequality, where ct = h + Pe R, the equality in (11611 ) follows from the fact that Xq~ 1+A is a 
deterministic function of M and i^* -1 , the equality in d 1 62b follows from the assumption that ti — ti—i < A, the 
equality in ( 1164b follows from ( 1138b , and the equality in ( 1165b follows from ( 1139b . Hence, we obtained that for 
every t 

R<^h(X^Y a T ) + e T . (167) 

Since the number of codewords is finite, we may consider the input signal of the form Xq'* with = 
/(M^,yg i_A ), where the cardinality of Uq is bounded, i.e., \Uq\ < oo for any T, independently of the partition t. 
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Furthermore, 

R<mi^I t (X^Y Q T ) + e T , 

= ^I(Xo -> Y a T ) + e T . (168) 

(T) 

Finally, for any R that is achievable there exists a sequence of codes such that lim/r^oo Pe — 0, hence et — > 



and we have established (I146l l. ■ 
Note that as a byproduct of the sequence of equalities ( 1158b -( [T66*l l. we conclude that for the communication 
system depicted in Fig. [3j 

J(M; Y T ) = inf 7 t (X T -> y o T ) = I(x£ -> F T ). (169) 

t:t$ — ti_i<.o 



The only assumptions that we used to prove (1158b — (1166b is that the encoders uses a strictly causal feedback of the 
form given in ( 1 14-4-b and that the channel satisfies the benign assumption given in d 138b - This might be a valuable 
results by itself that provides a good intuition why directed information characterizes the capacity of a continuous- 
time channel. Furthermore, the interpretations of the measure I(M ; Y^f), for instance, as given in ||23I . should also 
hold for directed information and vice versa. 

For the proof of achievability we will use the following result for discrete-time channels. 

Lemma 6. Consider the discrete-time channel, where the input Ui at time i has a finite alphabet, i.e., \U\ < oo, 
and the output Yi at time i has an arbitrary alphabet y. We assume that the relation between the input and the 
output is given by 

Y i = g(U i ,Z i ), (170) 

where the noise process {Zi}i>i is stationary and ergodic with an arbitrary alphabet Z. Then, any rate R is 
achievable for this channel if 

R < max I {U;Y). (171) 

p{u) 

Proof: Fix the pmf p(u) that attains the maximum in (1171b . Since I(U ; Y) can be approximated arbitrarily 
close by a finite partition of Y lfT6"l . assume without loss of generality that y is finite. The proof uses the random 
codebook generation and joint typicality decoding in ll43l Lecture 3]. Randomly and independently generate 2 nR 
codewords u n (m), m = 1, 2, ... , 2 nR , each according to Yl7=i Pu(ui). The decoder finds the unique rh such that 
(u n (m),y n ) is jointly typical. (For the definition of joint typicality, refer to ll43l Lecture 2]) Now, assuming that 
M = 1 is sent, the decoder makes an error only if (U n (l),Y n ) is not typical or (U n (m),Y n ) is typical for some 
m ^ 1. By the packing lemma ( [43, Lecture 3]), the probability of the second event tends to zero as n — > oo if 
R < I(U;Y). To bound the probability of the first event, recall from 0H Thm 10.3.1] that if {Ui} is i.i.d. and 
{Zi} is stationary ergodic, independent of {Ui}, then the pair {(Ui, Zj)} is jointly stationary ergodic. Consequently, 
from the definition of the channel in (1170b . {(C/j,Fj)} is jointly stationary ergodic. Thus, by Birkhoff's ergodic 
theorem, the probability that (U n (l), Y n ) is not typical tends to zero as n — > oo. Therefore, any rate R < I(U; Y) 
is achievable. ■ 
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The proof of achievability is based on the lemma above and the definition of directed information for continuous 
time. It is essential to divide into small time-interval as well as increasing the feedback delay by a small but positive 
value S > 0. 

Proof of achivability for Theorem\5\ Let A' = A+S, where S > 0. In addition, let t = (0 = to, t\, . . . , t n = T) 
be such that ti — ti-\ < S for all i = 1, 2, . . . , n. Let Xj'* be of the form 



. VX) A', 
where the cardinality of Uq is bounded. Then we show that any rate 



Xtr = (172) 



^<i7 t (X^ t ->y r ), (173) 



is achievable. 

Assume that the communication is over the time interval [0, nT], where T is fixed and n may be chosen to 
be as large as needed. Partition the time interval [0,nT] into n subintervals of length T and in each subinterval 
[jT, jT + T), which we index by j, fix the relation 

f f(U jT+T yJ T +**- A 'l f > A' 

^ + + t,= V (174) 

(Wg+ T ) (,<A'. 

Note that this coding scheme is possible with feedback delay A since — A > ti — A'. This follows from 
the assumption that i,; — < S and A' — A > S. Now, let us define a discrete-time channel where the input 
at time j + 1 is Uj+i = U^ +T (which has an alphabet [1, . . . , 2™ T ]), the output at time j + 1 is the vector 
Yj + i = (Y^ +tl , . . . , YjT+tt-! ' • ■ • ' ^r+t T _i ) anc ^ trie n °i se at ti me j + 1 is ^y+i = Z^ +T . Note that since 
Z J jt +T is a stationary and block-ergodic the noise process {Zj+i}j>o is stationary and ergodic. Furthermore the 
relation Sj+i = f(Uj+i, Zj + i) holds and the alphabet of t/j+i is finite. Hence by Lemma [6] any rate 

R< maxI(U;Y), (175) 

p(«) 

is achievable. Now using the definition of the discrete-time channel and the properties of directed information, we 
obtain 

I(U;Y)=I(U T ;Y T ) (176) 

= /(fj T ; y tl ,r/ i 2 ,...,y/;_ 1 ) (177) 

= I t (XZ' t ->Y?' t ), (178) 

where the equality in (1176b follows from the definition of the discrete-time channel and the equality in (1178b follows 
from the same sequence of equalities as in ( 1 1 58b — (1166b . Since (1178b holds for any t such that ti — U-i < S we 
conclude that 

C(A) >inf/t(X T ^^o T )- (179) 
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Finally by the definition of directed information and by the fact that ( 1179t holds for any T we have established 
<fl46t . ■ 

VII. Concluding Remarks 

We have introduced and developed a notion of directed information between continuous-time stochastic processes. 
It emerges naturally in the characterization of the fundamental limit on reliable communication for a wide class of 
continuous-time channels with feedback, quite analogously to the discrete-time setting. It also arises in estimation 
theoretic relations as the replacement for mutual information when extending the scope to the presence of feedback. 
In particular, with continuous-time directed information replacing mutual information, Duncan's theorem generalizes 
to estimation problems in which the evolution of the target signal is affected by the past channel noise. An analogous 
relationship based on the directed information holds for the Poisson channel. We have illustrated the use of the 
latter in an explicit computation of the directed information rate between the input and output of a Poisson channel 
where the input intensity changes only when there is an event at the channel output. One direction for future 
exploration is to use the "multiletter" characterization of capacity developed here to compute or approximate the 
feedback capacity of interesting continuous-time channels. 
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